23 research outputs found
A Common Misassumption in Online Experiments with Machine Learning Models
Online experiments such as Randomised Controlled Trials (RCTs) or A/B-tests
are the bread and butter of modern platforms on the web. They are conducted
continuously to allow platforms to estimate the causal effect of replacing
system variant "A" with variant "B", on some metric of interest. These variants
can differ in many aspects. In this paper, we focus on the common use-case
where they correspond to machine learning models. The online experiment then
serves as the final arbiter to decide which model is superior, and should thus
be shipped.
The statistical literature on causal effect estimation from RCTs has a
substantial history, which contributes deservedly to the level of trust
researchers and practitioners have in this "gold standard" of evaluation
practices. Nevertheless, in the particular case of machine learning
experiments, we remark that certain critical issues remain. Specifically, the
assumptions that are required to ascertain that A/B-tests yield unbiased
estimates of the causal effect, are seldom met in practical applications. We
argue that, because variants typically learn using pooled data, a lack of model
interference cannot be guaranteed. This undermines the conclusions we can draw
from online experiments with machine learning models. We discuss the
implications this has for practitioners, and for the research literature
Offline Recommender System Evaluation under Unobserved Confounding
Off-Policy Estimation (OPE) methods allow us to learn and evaluate
decision-making policies from logged data. This makes them an attractive choice
for the offline evaluation of recommender systems, and several recent works
have reported successful adoption of OPE methods to this end. An important
assumption that makes this work is the absence of unobserved confounders:
random variables that influence both actions and rewards at data collection
time. Because the data collection policy is typically under the practitioner's
control, the unconfoundedness assumption is often left implicit, and its
violations are rarely dealt with in the existing literature.
This work aims to highlight the problems that arise when performing
off-policy estimation in the presence of unobserved confounders, specifically
focusing on a recommendation use-case. We focus on policy-based estimators,
where the logging propensities are learned from logged data. We characterise
the statistical bias that arises due to confounding, and show how existing
diagnostics are unable to uncover such cases. Because the bias depends directly
on the true and unobserved logging propensities, it is non-identifiable. As the
unconfoundedness assumption is famously untestable, this becomes especially
problematic. This paper emphasises this common, yet often overlooked issue.
Through synthetic data, we empirically show how na\"ive propensity estimation
under confounding can lead to severely biased metric estimates that are allowed
to fly under the radar. We aim to cultivate an awareness among researchers and
practitioners of this important problem, and touch upon potential research
directions towards mitigating its effects.Comment: Accepted at the CONSEQUENCES'23 workshop at RecSys '2
On (Normalised) Discounted Cumulative Gain as an Off-Policy Evaluation Metric for Top- Recommendation
Approaches to recommendation are typically evaluated in one of two ways: (1)
via a (simulated) online experiment, often seen as the gold standard, or (2)
via some offline evaluation procedure, where the goal is to approximate the
outcome of an online experiment. Several offline evaluation metrics have been
adopted in the literature, inspired by ranking metrics prevalent in the field
of Information Retrieval. (Normalised) Discounted Cumulative Gain (nDCG) is one
such metric that has seen widespread adoption in empirical studies, and higher
(n)DCG values have been used to present new methods as the state-of-the-art in
top- recommendation for many years.
Our work takes a critical look at this approach, and investigates when we can
expect such metrics to approximate the gold standard outcome of an online
experiment. We formally present the assumptions that are necessary to consider
DCG an unbiased estimator of online reward and provide a derivation for this
metric from first principles, highlighting where we deviate from its
traditional uses in IR. Importantly, we show that normalising the metric
renders it inconsistent, in that even when DCG is unbiased, ranking competing
methods by their normalised DCG can invert their relative order. Through a
correlation analysis between off- and on-line experiments conducted on a
large-scale recommendation platform, we show that our unbiased DCG estimates
strongly correlate with online reward, even when some of the metric's inherent
assumptions are violated. This statement no longer holds for its normalised
variant, suggesting that nDCG's practical utility may be limited
RecFusion: A Binomial Diffusion Process for 1D Data for Recommendation
In this paper we propose RecFusion, which comprise a set of diffusion models
for recommendation. Unlike image data which contain spatial correlations, a
user-item interaction matrix, commonly utilized in recommendation, lacks
spatial relationships between users and items. We formulate diffusion on a 1D
vector and propose binomial diffusion, which explicitly models binary user-item
interactions with a Bernoulli process. We show that RecFusion approaches the
performance of complex VAE baselines on the core recommendation setting (top-n
recommendation for binary non-sequential feedback) and the most common datasets
(MovieLens and Netflix). Our proposed diffusion models that are specialized for
1D and/or binary setups have implications beyond recommendation systems, such
as in the medical domain with MRI and CT scans.Comment: code: https://github.com/gabriben/recfusio
Offline Evaluation of Reward-Optimizing Recommender Systems: The Case of Simulation
Both in academic and industry-based research, online evaluation methods are
seen as the golden standard for interactive applications like recommendation
systems. Naturally, the reason for this is that we can directly measure utility
metrics that rely on interventions, being the recommendations that are being
shown to users. Nevertheless, online evaluation methods are costly for a number
of reasons, and a clear need remains for reliable offline evaluation
procedures. In industry, offline metrics are often used as a first-line
evaluation to generate promising candidate models to evaluate online. In
academic work, limited access to online systems makes offline metrics the de
facto approach to validating novel methods. Two classes of offline metrics
exist: proxy-based methods, and counterfactual methods. The first class is
often poorly correlated with the online metrics we care about, and the latter
class only provides theoretical guarantees under assumptions that cannot be
fulfilled in real-world environments. Here, we make the case that
simulation-based comparisons provide ways forward beyond offline metrics, and
argue that they are a preferable means of evaluation.Comment: Accepted at the ACM RecSys 2021 Workshop on Simulation Methods for
Recommender System